High-accuracy splice site prediction based on sequence component and position features.

نویسندگان

  • J L Li
  • L F Wang
  • H Y Wang
  • L Y Bai
  • Z M Yuan
چکیده

Identification of splice sites plays a key role in the annotation of genes. Consequently, improvement of computational prediction of splice sites would be very useful. We examined the effect of the window size and the number and position of the consensus bases with a chi-square test, and then extracted the sequence multi-scale component features and the position and adjacent position relationship features of consensus sites. Then, we constructed a novel classification model using a support vector machine with the previously selected features and applied it to the Homo sapiens splice site dataset. This method greatly improved cross-validation accuracies for training sets with true and spurious splice sites of both equal and different proportions. This method was also applied to the NN269 dataset for further evaluation and independent testing. The results were superior to those obtained with previous methods, and demonstrate the stability and superiority of this method for prediction of splice sites.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating the Accuracy of Splice Site Prediction based on Integrating Jensen-Shannon Divergence and a Polynomial Equation of Order 2

Advances in DNA sequencing technology have caused generation of the vast amount of new sequence data. It is essential to understand the functions, features, and structures of every newly sequenced data. Analyzing sequence data by different methods could provide important information about the sequence data. One of the essential tasks for genome annotation is gene prediction that can help to und...

متن کامل

A Novel Splice Site Prediction Method using Support Vector Machine ?

We present a novel classification method for splice sites prediction using support vector machine (SVM). The method first represents input sequences by sequence-based features, including the information of the distribution of tri-nucleotides and the conserved features surrounding the splice sites characterized by Markov model. An F-score based feature selection method is then used to select inf...

متن کامل

Modelling splice sites with locality-sensitive sequence features

The splice sites are essential for pre-mRNA maturation and crucial for Splice Site Modelling (SSM); however, there are gaps between the splicing signals and the computationally identified sequence features. In this paper, the Locality Sensitive Features (LSFs) are proposed to reduce the gaps by homogenising their contexts. Under the skewness-kurtosis based statistics and data analysis, SSM attr...

متن کامل

SplicePort—An interactive splice-site analysis tool

SplicePort is a web-based tool for splice-site analysis that allows the user to make splice-site predictions for submitted sequences. In addition, the user can also browse the rich catalog of features that underlies these predictions, and which we have found capable of providing high classification accuracy on human splice sites. Feature selection is optimized for human splice sites, but the se...

متن کامل

Hidden Markov Model for Splicing Junction Sites Identification in DNA Sequences

Identification of coding sequence from genomic DNA sequence is the major step in pursuit of gene identification. In the eukaryotic organism, gene structure consists of promoter, intron, start codon, exons and stop codon, etc. and to identify it, accurate labeling of the mentioned segments is necessary. Splice site is the ‘separation’ between exons and introns, the predicted accuracy of which is...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Genetics and molecular research : GMR

دوره 11 3  شماره 

صفحات  -

تاریخ انتشار 2012